interlinguafandomcom-20200215-history
Frequency lists
Frequency Lists Counting words and lemmas: The following frequency lists count distinct orthographic words, including inflected forms. For example, the verb "be" is represented by "is", "are", "were", "be", etc. English TV and movie scripts Most common words in TV and movie scripts: Here are frequency lists comparable to the Gutenberg ones, but based on 29,213,800 words from TV and movie scripts and transcripts. Here's a fuller explanation of how the list was generated and its limitations: Wiktionary:Frequency lists/TV/2006/explanation. Here are the top 100 words (from tv scripts) in alphabetical order: :a · about · all · and · are · as · at · back · be · because · been · but · can · can't · come · could · did · didn't · do · don't · for · from · get · go · going · good · got · had · have · he · her · here · he's · hey · him · his · how · I · if · I'll · I'm · in · is · it · it's · just · know · like · look · me · mean · my · no · not · now · of · oh · OK · okay · on · one · or · out · really · right · say · see · she · so · some · something · tell · that · that's · the · then · there · they · think · this · time · to · up · want · was · we · well · were · what · when · who · why · will · with · would · yeah · yes · you · your · you're Here they are in frequency order: :1-1000 · 1001-2000 · 2001-3000 · 3001-4000 · 4001-5000 · 5001-6000 · 6001-7000 · 7001-8000 · 8001-9000 · 9001-10000 From the 10000th to the 40000th : :10001-12000 · 12001-14000 · 14001-16000 · 16001-18000 · 18001-20000 · 20001-22000 · 22001-24000 · 24001-26000 · 26001-28000 · 28001-30000 · 30001-32000 · 32001-34000 · 34001-36000 · 36001-38000 · 38001-40000 :40001-41284 (the dregs that were tied for 40,000th place) That'll probably be it. It's a third of all the unique words. The rest were used 5 or fewer times each. Project Gutenberg Most common words in project Gutenberg: These lists are the most frequent words, when performing a simple, straight (obvious) frequency count of all the books found on Project Gutenberg. The list of books was downloaded in July of 2005, and "rsync"'ed monthly thereafter. These are mostly English words, with some other languages finding representation to a lesser extent. Many Project Gutenberg books are scanned once their copyright expires, typically book editions published before 1923, so the language does not represent modern usage. For example, "thy" is listed as the 253rd most common word. Also, with 24,000+ books, the text of the boilerplate warning for Project Gutenberg appears on each of them. Here are the top 100 words (from Project Gutenberg texts) in alphabetical order: :a · about · after · all · and · any · an · are · as · at · been · before · be · but · by · can · could · did · down · do · first · for · from · good · great · had · has · have · her · he · him · his · if · into · in · is · its · it · I · know · like · little · made · man · may · men · me · more · Mr · much · must · my · not · now · no · of · on · one · only · or · other · our · out · over · said · see · she · should · some · so · such · than · that · the · their · them · then · there · these · they · this · time · to · two · upon · up · us · very · was · were · we · what · when · which · who · will · with · would · you · your *These wikified terms can be copied to other language wiktionaries, this is what they are intended for. If you do, please add an interwiki link onto the page here. :New list as of 4/16/2006: *Wiktionary:Frequency lists/PG/2006/04/1-10000 *Wiktionary:Frequency lists/PG/2006/04/10001-20000 *Wiktionary:Frequency lists/PG/2006/04/20001-30000 *Wiktionary:Frequency lists/PG/2006/04/30001-40000 :New list as of 10/10/2005: *Wiktionary:Frequency lists/PG/2005/10/1-10000 :The same list divided by thousand words: :1-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-9000 9001-10000 :more to come... :::Older lists :Most common words, in order of rank: *Wiktionary:Frequency lists/Project Gutenberg 1-10000 *Wiktionary:Frequency lists/Project Gutenberg 10001-20000 *Wiktionary:Frequency lists/Project Gutenberg 20001-30000 *Wiktionary:Frequency lists/Project Gutenberg 30001-40000 *Wiktionary:Frequency lists/Project Gutenberg 40001-50000 *Wiktionary:Frequency lists/Project Gutenberg 50001-60000 *Wiktionary:Frequency lists/Project Gutenberg 60001-70000 *Wiktionary:Frequency lists/Project Gutenberg 70001-80000 *Wiktionary:Frequency lists/Project Gutenberg 80001-90000 *Wiktionary:Frequency lists/Project Gutenberg 90001-100000 ::Approximately 24,197 files, 1,712,082,956 words, 70,756.0 average words/file. from which were gleaned about 9,053,310 unique "words." *From the straight frequency count, the current copy of Wiktionary was then removed from that list. Even entries that only have a redirect were removed. :# Wiktionary:Frequency lists/Project Gutenberg undefined 1-1000 *With somewhat different filtering/selection criteria: :# Wiktionary: Frequency Lists/Project Gutenberg undefined B 1-1000 *The latest version can always be found at: : User:Connel MacKenzie/Gutenberg Contemporary fiction The 2,000 most common words in contemporary fiction can be found here: *Wiktionary:Frequency lists/Contemporary fiction. The 2,000 most common words in contemporary fiction can be found here divided into 60 subject categories. *Wiktionary:Frequency lists/Contemporary fiction in 60 categories. This lumps regular lemmas of the same word together, unlike most of these lists. Contemporary poetry The 2,000 most common words in contemporary poetry can be found here: *Wiktionary:Frequency lists/Contemporary poetry. Another lemma-based list. Top English words lists *Category:100 English basic words *Category:200 English basic words *Category:1000 English basic words */Complete Shakespeare wordlist/ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Word families * British National Corpus - most frequent word families: see the simple:Wiktionary:BNC spoken freq on Simple English Wiktionary. * Academic Word List by word family: see the simple:Wiktionary:Academic word list on Simple English Wiktionary. Czech * /Czech Dutch The thirteen most popular Dutch words: From Max Havelaar (numbers between parentheses denote occurrences): # de (4770) # en (2709) # het, 't (2469) # van (2259) # ik (1999) # te (1935) # dat (1875) # die (1807) # in (1639) # een (1637) # hij (1328) # niet (1162) # zijn (1049) Frequency of diacritic characters in Dutch: From diacritic characters in the Dutch language: French Frequency lists from http://wortschatz.uni-leipzig.de/html/wliste.html with the authorization from the laboratory. *top 2000 words *Wiktionary:French frequency lists/2001-4000 *Wiktionary:French frequency lists/4001-6000 *Wiktionary:French frequency lists/6001-8000 *Wiktionary:French frequency lists/8001-10000 Note: these indicative lists still require some cleanup, because: # they don't unify common words that are normally not capitalized in the dictionary, but can be capitalized at the begining of sentences or in titles; # they do not break correctly words preceded by a separate word contracted with an apostrophe for very common articles (l’) or preposition (d’) or negation adverb (n’) or pronoun (c’, j’, l’, m’, s’, t’), or verbal liaison particles (-t-, -z-, which are not really words as they don't have any meaning but are written for phonetic reason), or pronoun subjects just after the verb (after a mandatory linking hyphen, that still does not make a compound word but denotes the inversion of the subject rather than the normal occurrence of an object): all these words should be counted separately; # the source is certainly from Belgian French written papers only, with typical occurrences for that country and no equivalence for France, or other French speaking countries where these words are much rarely used (such as currency abbreviations, Belgian toponyms for regions and cities, and many missing terms for very common specialties in France); # the list contains isolated letters that are not words, per se (except a few effective words: a, à, y); # as well, there are acronyms and symbols occurring only in written documents but not as part of the spoken language; # frequent proper names are included but are not very specific to any of the 4 studied languages. This list does not unify inflected words (with plural or feminine mark on nouns or adjectives, or conjugated verbs), and does not recognize auxiliaries of verbs at compound tenses as part of the conjugated verb, but treat auxiliaries separately for each inflected form. Galician * /Galician German German words in Wikipedia: * Wiktionary:Frequency lists/top 2000 German Wikipedia words :See also the 100, 1000, or 10 000 most frequent words. Top 2000 German words from subtitles: * 1-1000 * 1001-2000 Hungarian Top 100.000 words in Hungarian text: http://mokk.bme.hu/resources/webcorpus /Hungarian frequency list 1-10000/ Icelandic Icelandic verbs: * The 100 most frequent Icelandic verbs according to the verb webpage. * /Icelandic verb frequency list 1-100/ Italian Top 1000 Italian words from subtitles: * 1-1000 Korean Top 200 Korean words: *Korean 200 Polish Top 200 Polish words: *List of top 200 Polish words Russian * List of top 1000 Russian words Slovene 50 most frequent Slovene words, Primož Jakopin research: je , in , se , v , da , na , so , ne , pa , ki , bi , za , z , ni , sem , ga , še , po , s , tako , ko , tudi , to , bil , ali , si , mu , od , bilo , kot , že , iz , kaj , bo , če , vse , bila , kakor , mi , pri , jo , kar , jih , sta , o , do , ti , kako , samo , me * 5.2.3 Primož Jakopin: Zgornja meja entropije pri leposlovnih besedilih v slovenskem jeziku Spanish Top 10000 Spanish words from subtitles: * 1-1000 * -2000 * -3000 * -4000 * -5000 * -6000 * -7000 * -8000 * -9000 * -10000 Swedish *Wiktionary:Frequency lists/top 2000 Swedish Wikipedia words */Swedish (similar, but not identical) Thai * Appendix:100 basic Thai words :: If this is just "basic" words, not statistically the "most frequent" words, it shouldn't be here, it should be in the Appendix namespace only. --Connel MacKenzie 20:59, 26 December 2006 (UTC) Turkish * List of top 1000 Turkish words Yiddish * Top 600 Yiddish words Yiddish in other Wiktionaries: * French - * Finnish - * Latin - * Tamil External links *Russian words with pictures and grammar - with English translations *1000 most common Russian words - with English translations *Word Frequency List of Chilean Spanish - (Lifcach), Scott Sadowsky & Ricardo Martínez Gamboa The Word Frequency List of Chilean Spanish (Lifcach) is a set of 102 frequency lists derived from the sub-corpora of the Corpus Dinámico del Castellano de Chile (Dynamic Corpus of Chilean Spanish, Codicach), a corpus of contemporary written Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the Lifcach was created (it currently contains some 800 million words). The Lifcach also contains a non-weighted list of total frequencies (the Total Occurrences column), which is simply the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire Codicach corpus.) es:Wikcionario:Palabras más frecuentes del español cy:Wiciadur:Rhestri amlder geiriau fi:Wikisanakirja:Frequency lists/PG/2006/04/1-10000 fr:Wiktionnaire:Listes de fréquence Latin ru:Приложение:Рейтинги частотности слов sv:Wiktionary:Projekt/Frekvensordlista Tamil vi:Wiktionary:Bảng tần số